high quality
Segment Anything in High Quality
The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask.
Enhancing Multi-Text Long Video Generation Consistency without Tuning: Time-Frequency Analysis, Prompt Alignment, and Theory
Li, Xingyao, Zhang, Fengzhuo, Pan, Jiachun, Hou, Yunlong, Tan, Vincent Y. F., Yang, Zhuoran
Despite the considerable progress achieved in the long video generation problem, there is still significant room to improve the consistency of the videos, particularly in terms of smoothness and transitions between scenes. We address these issues to enhance the consistency and coherence of videos generated with either single or multiple prompts. We propose the Time-frequency based temporal Attention Reweighting Algorithm (TiARA), which meticulously edits the attention score matrix based on the Discrete Short-Time Fourier Transform. Our method is supported by a theoretical guarantee, the first-of-its-kind for frequency-based methods in diffusion models. For videos generated by multiple prompts, we further investigate key factors affecting prompt interpolation quality and propose PromptBlend, an advanced prompt interpolation pipeline. The efficacy of our proposed method is validated via extensive experimental results, exhibiting consistent and impressive improvements over baseline methods. The code will be released upon acceptance.
- South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
- North America > United States > North Dakota > Billings County (0.04)
- Europe > Czechia > Prague (0.04)
- (2 more...)
EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation
Wang, Yongxin, Cao, Meng, Lin, Haokun, Han, Mingfei, Ma, Liang, Jiang, Jin, Cheng, Yuhao, Liang, Xiaodan
Multimodal large language models (MLLMs) have achieved remarkable progress on various visual question answering and reasoning tasks leveraging instruction fine-tuning specific datasets. They can also learn from preference data annotated by human to enhance their reasoning ability and mitigate hallucinations. Most of preference data is generated from the model itself. However, existing methods require high-quality critical labels, which are costly and rely on human or proprietary models like GPT-4V. In this work, we propose Enhancing Alignment in MLLMs via Critical Observation (EACO), which aligns MLLMs by self-generated preference data using only 5k images economically. Our approach begins with collecting and refining a Scoring Evaluation Instruction-tuning dataset to train a critical evaluation model, termed the Critic. This Critic observes model responses across multiple dimensions, selecting preferred and non-preferred outputs for refined Direct Preference Optimization (DPO) tuning. To further enhance model performance, we employ an additional supervised fine-tuning stage after preference tuning. EACO reduces the overall hallucinations by 65.6% on HallusionBench and improves the reasoning ability by 21.8% on MME-Cognition. EACO achieves an 8.5% improvement over LLaVA-v1.6-Mistral-7B across multiple benchmarks. Remarkably, EACO also shows the potential critical ability in open-source MLLMs, demonstrating that EACO is a viable path to boost the competence of MLLMs.
CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models
Wang, Liangdong, Zhang, Bo-Wen, Wu, Chengwei, Zhao, Hanyu, Shi, Xiaofeng, Gu, Shuhao, Li, Jijie, Ma, Quanyue, Pan, TengFei, Liu, Guang
The success of Large Language Models (LLMs) [1][2] is primarily attributed to the availability of extensive, high-quality pre-training corpora, which underpin their foundational knowledge and reasoning capabilities for a variety of tasks, from creative writing to complex problem-solving. Among them, the Open-source datasets, such as The Pile[3] and Common Crawl[4], have been instrumental in propelling LLM development, fostering collaboration and establishing benchmarks for innovation. Existing Researchers focus more on scaling high-quality data. Recently the demand for pre-training data has exceeded 10 trillion tokens [1][5][6], underscoring two key trajectories in English pre-training: scaling data and improving its quality. Open-source datasets have rapidly expanded, evolving from collections like the Pile(825GB) to larger datasets such as FineWeb(15TB)[7], which draw extensively from Common Crawl. Simultaneously, the focus has shifted from rule-based filtering methods, as seen in early projects like Redpajama[8], to model-driven approaches exemplified by FineWeb-Edu[7]. Despite the rapid advancement of English open-source datasets, Chinese data remains significantly underrepresented on the global web. Existing open-source Chinese datasets, such as WuDao [9], SkyPile150B [10], and WanjuanV1 [11], are constrained in scale due to a scarcity of Chinese data sources online. Furthermore, there is limited research focused on improving quality classification for Chinese web data, resulting in suboptimal data quality.
- North America > United States (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > China > Beijing > Beijing (0.04)
Follow-Up Questions Improve Documents Generated by Large Language Models
This study investigates the impact of Large Language Models (LLMs) generating follow-up questions in response to user requests for short (1-page) text documents. Users provided prompts requesting documents they would like the AI to produce. The AI then generated questions to clarify the user's needs before generating the requested documents. Users answered the questions and then indicated their preference between a document generated using both the initial prompt and the questions and answers, and a document generated using only the initial prompt, and gave feedback about their experience with the question-answering process.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Switzerland (0.04)
- Asia > India (0.04)
- Research Report > New Finding (0.47)
- Research Report > Experimental Study (0.46)
FIFO-Diffusion: Generating Infinite Videos from Text without Training
Kim, Jihwan, Kang, Junoh, Choi, Jinyoung, Han, Bohyung
We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without additional training. This is achieved by iteratively performing diagonal denoising, which concurrently processes a series of consecutive frames with increasing noise levels in a queue; our method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising is a double-edged sword as the frames near the tail can take advantage of cleaner ones by forward reference but such a strategy induces the discrepancy between training and inference. Hence, we introduce latent partitioning to reduce the training-inference gap and lookahead denoising to leverage the benefit of forward referencing. Practically, FIFO-Diffusion consumes a constant amount of memory regardless of the target video length given a baseline model, while well-suited for parallel inference on multiple GPUs. We have demonstrated the promising results and effectiveness of the proposed methods on existing text-to-video generation baselines. Generated video samples and source codes are available at our project page.
- South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Transportation > Passenger (0.68)
- Transportation > Air (0.46)
Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models
Luo, Yi, Lin, Zhenghao, Zhang, Yuhao, Sun, Jiashuo, Lin, Chen, Xu, Chengjin, Su, Xiangdong, Shen, Yelong, Guo, Jian, Gong, Yeyun
Large Language Models (LLMs) exhibit impressive capabilities but also present risks such as biased content generation and privacy issues. One of the current alignment techniques includes principle-driven integration, but it faces challenges arising from the imprecision of manually crafted rules and inadequate risk perception in models without safety training. To address these, we introduce Guide-Align, a two-stage approach. Initially, a safety-trained model identifies potential risks and formulates specific guidelines for various inputs, establishing a comprehensive library of guidelines and a model for input-guidelines retrieval. Subsequently, the retrieval model correlates new inputs with relevant guidelines, which guide LLMs in response generation to ensure safe and high-quality outputs, thereby aligning with human values. An additional optional stage involves fine-tuning a model with well-aligned datasets generated through the process implemented in the second stage. Our method customizes guidelines to accommodate diverse inputs, thereby enhancing the fine-grainedness and comprehensiveness of the guideline library. Furthermore, it incorporates safety expertise from a safety-trained LLM through a lightweight retrieval model. We evaluate our approach on three benchmarks, demonstrating significant improvements in LLM security and quality. Notably, our fine-tuned model, Labrador, even at 13 billion parameters, outperforms GPT-3.5-turbo and surpasses GPT-4 in alignment capabilities.
- North America > Canada > Newfoundland and Labrador > Labrador (0.25)
- Asia > Maldives (0.04)
- Asia > Indonesia > Bali (0.04)
- (37 more...)